We start from the demultiplexed SingleCellExperiment object created in ‘Demultiplexing the Cooney (C057) memory CD4 T-cell’.
Cell-based annotations are included in the colData of the SingleCellExperiment. We store the counts from the hashtag features in the colData of the SingleCellExperiment.
This dataset contains samples from 7 samples. All samples are human memory CD4+ T cells from ‘humanised mice’. Humanised mice are an immunodeficient strain and injected with human stem cells as pups which then differentiate into human hematopoietic cells as the mice mature. Mice were infected or not infected with HTLV, with each condition done in biological triplicate (1 mouse / replicate).
Figure 1: Breakdown of the samples
Figure 2: Breakdown of the two treatments.
Having quantified gene expression against the Ensembl gene annotation, we have Ensembl-style identifiers for the genes. These identifiers are used as they are unambiguous and highly stable. However, they are difficult to interpret compared to the gene symbols which are more commonly used in the literature. Given the Ensembl identifiers, we obtain the corresponding gene symbols using annotation packages available through Bioconductor. Henceforth, we will use gene symbols (where available) to refer to genes in our analysis and otherwise use the Ensembl-style gene identifiers1.
Low-quality cells need to be removed to ensure that technical effects do not distort downstream analysis results. We use several quality control (QC) metrics to measure the quality of the cells:
log10_total_counts: This measures the library size of the cells, which is the total sum of counts across both genes and spike-in transcripts. We want cells to have high library sizes as this means more RNA has been successfully captured during library preparation.total_features_by_counts: This is the number of expressed features2 in each cell. Cells with few expressed features are likely to be of poor quality, as the diverse transcript population has not been successfull captured.pct_counts_Mt: This measures the proportion of reads which are mapped to mitochondrial RNA. If there is a higher than expected proportion of mitochondrial RNA this is often symptomatic of a cell which is under stress and is therefore of low quality and will not be used for the analysis.In summary, we aim to remove cells with low library sizes, few expressed genes, and very high percentages of spike-in transcripts or mitochondrial DNA.
The distributions of these metrics are shown in Figure 3, stratified by sample
Figure 3: Distributions of various QC metrics for all cells in the dataset. This includes the library sizes, number of expressed genes, and proportion of reads mapped to spike-in transcripts or mitochondrial genes.
Figure 3 shows that the vast majority of samples are good-quality; the library sizes are in the thousands3, we observe around two thousand expressed genes per cell, and moderate mitochondrial percentages per cell.
It is notable that the Unknown samples have much smaller library sizes and fewer genes detected than the other samples.
TODO: What is the interpretation of the above result?
It is also valuable to examine how the other QC metrics behave with respect to each other (Figure 4). Generally, they will be in rough agreement, i.e., cells with low total counts will also have low numbers of expressed features and mitochondrial proportions. Clear discrepancies may correspond to technical differences between batches of cells or genuine biological differences in RNA content.
Figure 4: Behaviour of each QC metric compared to the total number of expressed features. Each point represents a cell in the data set.
For now, we retain the samples of Unknown origin because it may be interesting to see if these are from specific cell types.
Outliers are defined based on the median absolute deviation (MADs) from the median value of each metric across all cells. We remove small and large outliers for the library size and the number of expressed features, and large outliers for the spike-in proportions. Removal of low-quality cells is then performed by combining the filters for all of the metrics.
In order to avoid removing all the Unknown cells, we identify and remove outliers using sample-specific metrics.
The following table summarises the QC cutoffs:
| Sample | total counts | total features | %mito |
|---|---|---|---|
| infected_1 | 1359.7 | 861.0 | 11.2 |
| infected_2 | 1214.9 | 786.5 | 10.0 |
| infected_3 | 1190.2 | 796.5 | 10.8 |
| uninfected_1 | 1214.2 | 798.8 | 10.4 |
| uninfected_2 | 1215.9 | 822.5 | 10.2 |
| uninfected_3 | 1000.2 | 710.8 | 9.9 |
| Unknown | 343.5 | 280.2 | 9.3 |
The vast majority of cells are retained for all samples.
| Sample | ByLibSize | ByFeature | ByMito | Remaining | PercRemaining |
|---|---|---|---|---|---|
| infected_1 | 59 | 66 | 81 | 1529 | 92.2 |
| infected_2 | 50 | 62 | 66 | 1314 | 92.8 |
| infected_3 | 44 | 61 | 54 | 1417 | 93.5 |
| uninfected_1 | 70 | 100 | 112 | 1985 | 92.2 |
| uninfected_2 | 86 | 113 | 112 | 2298 | 92.5 |
| uninfected_3 | 106 | 160 | 85 | 1513 | 87.6 |
| Unknown | 54 | 60 | 59 | 3384 | 96.7 |
Finally, we look at the QC metrics for the cells remaining after outlier removal. Figure 5 shows the samples have similar QC metrics and the extreme values have been removed, as is to be expected following quality control.
Figure 5: Distributions of various QC metrics for all cells in the data set. This includes the library sizes, number of expressed genes, and proportion of reads mapped to spike-in transcripts or mitochondrial genes.
Some basic summaries of the quality-controlled data are shown in the following table. The quality controlled data maintain a reasonable representation of each cytokine type.
| Sample | number_remaining | percent_remaining |
|---|---|---|
| infected_1 | 1529 | 92 |
| infected_2 | 1314 | 93 |
| infected_3 | 1417 | 94 |
| uninfected_1 | 1985 | 92 |
| uninfected_2 | 2298 | 93 |
| uninfected_3 | 1513 | 88 |
| Unknown | 3384 | 97 |
Figure 6 shows this visually.
Figure 6: Breakdown of sample types after QC.
Figure 7 shows the most highly expressed genes across the cell population in the combined data set. Some of them are mitochondrial genes, matching what we’ve already seen in the QC metrics, and ribosomal protein genes, which commonly are amongst the most highly expressed genes in scRNA-seq data.
Figure 7: Percentage of total counts assigned to the top 50 most highly-abundant features in the combined data set. For each feature, each bar represents the percentage assigned to that feature for a single cell, while the circle represents the average across all cells. Bars are coloured by the total number of expressed features in each cell, while circles are coloured according to whether the feature is labelled as a control feature.
Figure 8 shows the most highly expressed endogeneous genes after excluding the mitochondrial and ribosomal protein genes.
Figure 8: Percentage of total counts assigned to the top 50 most highly-abundant features in the combined data set. For each feature, each bar represents the percentage assigned to that feature for a single cell, while the circle represents the average across all cells. Bars are coloured by the total number of expressed features in each cell, while circles are coloured according to whether the feature is labelled as a control feature.
Reassuringly, the most highly expressed genes are those we might expect to see in T cells (e.g., ACTB, B2M, TMSB4X).
Low-abundance genes are problematic as zero or near-zero counts do not contain much information for reliable statistical inference (Bourgon, Gentleman, and Huber 2010). These genes typically do not provide enough evidence to reject the null hypothesis during testing, yet they still increase the severity of the multiple testing correction. In addition, the discreteness of the counts may interfere with statistical procedures, e.g., by compromising the accuracy of continuous approximations. Thus, low-abundance genes are often removed in many RNA-seq analysis pipelines before the application of downstream methods.
The ‘optimal’ choice of filtering strategy depends on the downstream application. A more aggressive filter is usually required to remove discreteness (e.g., for normalization) compared to that required for removing underpowered tests. For hypothesis testing, the filter statistic should also be independent of the test statistic under the null hypothesis. Thus, we (or the relevant function) will filter at each step as needed, rather than applying a single filter for the entire analysis.
Several metrics can be used to define low-abundance genes. The most obvious is the average count for each gene, computed across all cells in the data set. We typically observe a peak of moderately expressed genes following a plateau of lowly expressed genes (Figure 9).
Figure 9: Histogram of log-average counts for all genes in the combined data set.
We remove 12951 genes that are not expressed in any cell. Such genes provide no information and would be removed by any filtering strategy. We retain 20587 for downstream analysis.
Read counts are subject to differences in capture efficiency and sequencing depth between cells (Stegle, Teichmann, and Marioni 2015). Normalization is required to eliminate these cell-specific biases prior to downstream quantitative analyses. This is often done by assuming that most genes are not differentially expressed (DE) between cells. Any systematic difference in count size across the non-DE majority of genes between two cells is assumed to represent bias and is removed by scaling. More specifically, ‘size factors’ are calculated that represent the extent to which counts should be scaled in each library.
Size factors can be computed with several different approaches, e.g., using the estimateSizeFactorsFromMatrix function in the DESeq2 package (Anders and Huber 2010; Love, Huber, and Anders 2014), or with the calcNormFactors function (Robinson and Oshlack 2010) in the edgeR package. However, single-cell data can be problematic for these bulk data-based methods due to the dominance of low and zero counts. To overcome this, we pool counts from many cells to increase the count size for accurate size factor estimation (Lun, Bach, and Marioni 2016). Pool-based size factors are then ‘deconvolved’ into cell-based factors for cell-specific normalization. This removes scaling biases associated with cell-specific differences in capture efficiency, sequencing depth and composition biases.
We check that the size factors are roughly aligned with the total library sizes (Figure 10). Deviations from the diagonal correspond to composition biases due to differential expression between cell subpopulations.
Figure 10: Size factors from deconvolution, plotted against library sizes for all cells in each dataset. Axes are shown on a log-scale.
The count data are used to compute normalized log-expression values for use in downstream analyses. Each value is defined as the log2-ratio of each count to the size factor for the corresponding cell, after adding a prior count of 1 to avoid undefined values at zero counts. Division by the size factor ensures that any cell-specific biases are removed.
The processed SingleCellExperiment object is available (see data/SCEs/C057_Cooney.processed.SCE.rds). This will be used in downstream analyses, e.g. when merging the different samples onto the same coordinate system.
The following are available on request:
The analysis and this document were prepared using the following software:
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.3 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
locale:
[1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
[5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats4 stats graphics grDevices utils
[7] datasets methods base
other attached packages:
[1] scran_1.12.1 ensembldb_2.8.1
[3] AnnotationFilter_1.8.0 GenomicFeatures_1.36.4
[5] AnnotationDbi_1.46.1 AnnotationHub_2.16.1
[7] BiocFileCache_1.8.0 dbplyr_1.4.2
[9] scater_1.12.2 cowplot_1.0.0
[11] ggplot2_3.2.1 BiocStyle_2.12.0
[13] janitor_1.2.0 rmarkdown_1.16
[15] here_0.1 SingleCellExperiment_1.6.0
[17] SummarizedExperiment_1.14.1 DelayedArray_0.10.0
[19] BiocParallel_1.18.1 matrixStats_0.55.0
[21] Biobase_2.44.0 GenomicRanges_1.36.1
[23] GenomeInfoDb_1.20.0 IRanges_2.18.3
[25] S4Vectors_0.22.1 BiocGenerics_0.30.0
loaded via a namespace (and not attached):
[1] ggbeeswarm_0.6.0 colorspace_1.4-1
[3] dynamicTreeCut_1.63-1 rprojroot_1.3-2
[5] XVector_0.24.0 BiocNeighbors_1.3.3
[7] bit64_0.9-7 interactiveDisplayBase_1.22.0
[9] knitr_1.25 zeallot_0.1.0
[11] colorblindr_0.1.0 Rsamtools_2.0.3
[13] shiny_1.4.0 BiocManager_1.30.8
[15] compiler_3.6.1 httr_1.4.1
[17] dqrng_0.2.1 backports_1.1.5
[19] assertthat_0.2.1 Matrix_1.2-17
[21] fastmap_1.0.1 lazyeval_0.2.2
[23] limma_3.40.6 later_1.0.0
[25] BiocSingular_1.0.0 htmltools_0.4.0
[27] prettyunits_1.0.2 tools_3.6.1
[29] igraph_1.2.4.1 rsvd_1.0.2
[31] gtable_0.3.0 glue_1.3.1
[33] GenomeInfoDbData_1.2.1 dplyr_0.8.3
[35] rappdirs_0.3.1 Rcpp_1.0.2
[37] vctrs_0.2.0 Biostrings_2.52.0
[39] rtracklayer_1.44.4 DelayedMatrixStats_1.6.1
[41] xfun_0.10 stringr_1.4.0
[43] mime_0.7 irlba_2.3.3
[45] statmod_1.4.32 XML_3.98-1.20
[47] edgeR_3.26.8 zlibbioc_1.30.0
[49] scales_1.0.0 hms_0.5.1
[51] promises_1.1.0 ProtGenerics_1.16.0
[53] yaml_2.2.0 curl_4.2
[55] memoise_1.1.0 gridExtra_2.3
[57] biomaRt_2.40.5 distill_0.7
[59] stringi_1.4.3 RSQLite_2.1.2
[61] highr_0.8 rlang_0.4.0
[63] pkgconfig_2.0.3 bitops_1.0-6
[65] evaluate_0.14 lattice_0.20-38
[67] purrr_0.3.2 GenomicAlignments_1.20.1
[69] labeling_0.3 bit_1.1-14
[71] tidyselect_0.2.5 magrittr_1.5
[73] R6_2.4.0 DBI_1.0.0
[75] pillar_1.4.2 withr_2.1.2
[77] RCurl_1.95-4.12 tibble_2.1.3
[79] crayon_1.3.4 viridis_0.5.1
[81] progress_1.2.2 locfit_1.5-9.1
[83] grid_3.6.1 blob_1.2.0
[85] digest_0.6.21 xtable_1.8-4
[87] httpuv_1.5.2 munsell_0.5.0
[89] beeswarm_0.2.3 viridisLite_0.3.0
[91] vipor_0.4.5
Anders, S., and W. Huber. 2010. “Differential Expression Analysis for Sequence Count Data.” Genome Biol. 11 (10): R106. https://www.ncbi.nlm.nih.gov/pubmed/20979621.
Bourgon, R., R. Gentleman, and W. Huber. 2010. “Independent Filtering Increases Detection Power for High-Throughput Experiments.” Proc. Natl. Acad. Sci. U.S.A. 107 (21): 9546–51. https://www.ncbi.nlm.nih.gov/pubmed/20460310.
Love, M. I., W. Huber, and S. Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for Rna-Seq Data with Deseq2.” Genome Biol. 15 (12): 550. https://www.ncbi.nlm.nih.gov/pubmed/25516281.
Lun, A. T., K. Bach, and J. C. Marioni. 2016. Genome Biol. 17 (April): 75. https://www.ncbi.nlm.nih.gov/pubmed/27122128.
Robinson, M. D., and A. Oshlack. 2010. “A Scaling Normalization Method for Differential Expression Analysis of Rna-Seq Data.” Genome Biol. 11 (3): R25. https://www.ncbi.nlm.nih.gov/pubmed/20196867.
Stegle, O., S. A. Teichmann, and J. C. Marioni. 2015. “Computational and Analytical Challenges in Single-Cell Transcriptomics.” Nat. Rev. Genet. 16 (3): 133–45. https://www.ncbi.nlm.nih.gov/pubmed/25628217.
Some care is taken to account for missing and duplicate gene symbols; missing symbols are replaced with the Ensembl identifier and duplicated symbols are concatenated with the (unique) Ensembl identifiers.↩
The number of expressed genes refers to the number of genes which have non-zero counts (ie. they have been identified in the cell at least once)↩
This is consistent with the use of UMI counts rather than read counts, as each transcript molecule can only produce one UMI count but can yield many reads after fragmentation.↩